Assignment 3
The plotting library I choose is Altair. The reason is that:
-
The interactive plot created by ipywidgets cannot be displayed in the fastpage, as is shown in the forum: link here and it is also not easy to put plotly figures in the fastpage blog.
-
Altair also offers a variety of interactive options with sliders and dropdowns, which can make the plot more vivid etc.
import pandas as pd
import altair as alt
from vega_datasets import data
Task 1
The first dataset is about the malaria deaths by country for all ages across the world and time. The entity is the full country name, the code column is the ISO3166 code. The head of the data is shown below. Considering that the data contains countries and time, the first plot we can consider is the map plot with a time slider.
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths.csv')
df.head()
In order to draw a map plot in altair, we need to add the country code to the original dataset so that we can map the countries into the world map. The country_info dataset contains full infomartion about each country, including the FIFA code, ISO3166 code etc.
country_info = pd.read_csv("https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv",dtype = 'str')
country_info.head()
We can do some data transformation to the original malaria death dataset and filter out the country that are in the country_info dataset
df.columns = ['name', 'ISO3166-1-Alpha-3','Year','Death_Rate'] # Change the column name to make the name of two datasets more consistent.
df_new = df[df['ISO3166-1-Alpha-3'].isin(country_info['ISO3166-1-Alpha-3'])] # check whether the country exists in the country_info dataset
df_new.head()
Merge the two dataset and exclude irrelevant columns
df_final = pd.merge(df_new, country_info , on = 'ISO3166-1-Alpha-3', how = 'left')
df_final = df_final[['name','Year','Death_Rate','ISO3166-1-numeric']]
df_final.head()
df_final.head()
Set the slider
alt.data_transformers.disable_max_rows() #The default row that altair can take is 5000, we need to specify the disable_max_rows if the rows are over 5000
countries = alt.topo_feature(data.world_110m.url, 'countries')
# Set the silder, step = 1, min year is 1990, max year is 2016
slider = alt.binding_range(
step=1,
min=1990,
max=2016
)
select_date = alt.selection_single(
name="Slider",
fields=['Year'],
bind=slider,
)
alt.Chart(df_final).mark_geoshape()\
.encode(color='Death_Rate:Q')\
.add_selection(select_date)\
.transform_filter(select_date)\
.transform_lookup(
lookup='ISO3166-1-numeric',
from_=alt.LookupData(countries, key='id', fields=["type", "properties", "geometry"])
)\
.project('equirectangular')\
.properties(
width=400,
height=300,
title='Malaria Death Rate (per 100,000 people) '
)
From the above map plot, we can see that most of the countries have death rate less than 50 per 100,000 people, countries with high malaria death rate are more likely to be in Afirica and as time goes by, the death rate continues to decrease. However, one problem with this plot may be that we cannot tell the trend of a single country, it is difficult to distinguish each country. Another plot we can try is the line chart with country as dropdown list.
df_final['Year'] = pd.to_datetime(df_final['Year'], format='%Y') #Change the year to datetime type
country_list = df_final['name'].dropna().unique()
country_list = country_list.tolist()
dropdown = alt.binding_select(
options = country_list
)
select_country = alt.selection_single(
name="dropdown",
fields=['name'],
bind = dropdown,
)
alt.Chart(df_final).mark_line().encode(
x='Year',
y='Death_Rate',
).add_selection(
select_country
).transform_filter(
select_country
).properties(
width = 500,
height=400,
title=f'Malaria Death Rate for a singe country (per 100,000 people)'
).configure_axis(
grid=False
)
From the line plot, we can see how the death rate of each country change over time.
df1 = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_inc.csv')
df1.head()
We can do some data transformation to the original malaria incidence dataset and filter out the country that are in the country_info dataset
df1.columns = ['name', 'ISO3166-1-Alpha-3','Year','Incidence_Rate']
df1_new = df1[df1['ISO3166-1-Alpha-3'].isin(country_info['ISO3166-1-Alpha-3'])]
df1_new.head()
df1_final = pd.merge(df1_new, country_info , on = 'ISO3166-1-Alpha-3', how = 'left')
df1_final = df1_final[['name','Year','Incidence_Rate','ISO3166-1-numeric']]
df1_final.head()
alt.data_transformers.disable_max_rows()
countries = alt.topo_feature(data.world_110m.url, 'countries')
slider1 = alt.binding_range(
step=5,
min=2000,
max=2015
)
select_date1 = alt.selection_single(
name="slider",
fields=['Year'],
bind=slider1,
)
alt.Chart(df1_final).mark_geoshape()\
.encode(color='Incidence_Rate:Q')\
.add_selection(select_date1)\
.transform_filter(select_date1)\
.transform_lookup(
lookup='ISO3166-1-numeric',
from_=alt.LookupData(countries, key='id', fields=["type", "properties", "geometry"])
)\
.project('equirectangular')\
.properties(
width=400,
height=300,
title='Malaria Incidence Rate (per 1,000 people) '
)
df1_final['Year'] = pd.to_datetime(df1_final['Year'], format='%Y') #Change the year to datetime type
country_list1 = df1_final['name'].dropna().unique()
country_list1 = country_list1.tolist()
dropdown1 = alt.binding_select(
options = country_list1
)
select_country1 = alt.selection_single(
name="dropdown",
fields=['name'],
bind = dropdown1,
)
alt.Chart(df1_final).mark_line().encode(
x='Year',
y='Incidence_Rate',
).add_selection(
select_country1
).transform_filter(
select_country1
).properties(
width = 500,
height=400,
title=f'Malaria Incidence Rate for a singe country (per 1,000 people)'
).configure_axis(
grid=False
)
df2 = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths_age.csv')
df2.head()
df2.columns = ['index','name', 'ISO3166-1-Alpha-3','Year','Age_Group','Death_Rate']
df2_new = df2[df2['ISO3166-1-Alpha-3'].isin(country_info['ISO3166-1-Alpha-3'])]
df2_new.head()
df2_final = pd.merge(df2_new, country_info , on = 'ISO3166-1-Alpha-3', how = 'left')
df2_final = df2_final[['name','Year','Death_Rate','ISO3166-1-numeric','Age_Group']]
df2_final.head()
The plot I choose is the line plot with dropdown, different color represent different age group. So based on this plot, we can see the trend of death rate of different countries for different age group.
df2_final['Year'] = pd.to_datetime(df2_final['Year'], format='%Y')
country_list2 = df2_final['name'].dropna().unique()
country_list2 = country_list2.tolist()
dropdown2 = alt.binding_select(
options = country_list2
)
select_country2 = alt.selection_single(
name="dropdown",
fields=['name'],
bind = dropdown2,
)
alt.Chart(df2_final).mark_point().encode(
x='Year',
y='Death_Rate',
color = 'Age_Group'
).add_selection(
select_country2
).transform_filter(select_country2)